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Abstract 

This paper proposes a new algorithm for multiple sparse regression in high dimensions, where the 
task is to estimate the support and values of several (typically related) sparse vectors from a few noisy 
linear measurements. Our algorithm is a "forward-backward" greedy procedure that - uniquely - operates 
on two distinct classes of objects. In particular, we organize our target sparse vectors as a matrix; our 
algorithm involves iterative addition and removal of both (a) individual elements, and (b) entire rows 
(corresponding to shared features), of the matrix. 

Analytically, we establish that our algorithm manages to recover the supports (exactly) and values 
(approximately) of the sparse vectors, under assumptions similar to existing approaches based on convex 
optimization. However, our algorithm has a much smaller computational complexity. Perhaps most 
interestingly, it is seen empirically to require visibly fewer samples. Ours represents the first attempt to 
extend greedy algorithms to the class of models that can only /best be represented by a combination of 
component structural assumptions (sparse and group-sparse, in our case). 



1 Introduction 

This paper provides a new algorithm for the (standard) multiple sparse linear regression problem, 
which we now describe. We are interested in inferring r sparse vectors . . . , /3*( r ) G R p from noisy linear 

measurements; in particular, for each 1 < j < r, we observe nj noisy linear measurements according to the 
statistical model 

y U) = x^')/T (j) + z U) Vj 6 {1, ... , r}, (1) 

where for each j, G R n J X;p is the design matrix, G W lj is the response vector and G M ni is the 
noise. We combine all tasks /3 *(■?') as columns of a matrix /3* G R pxr . We are thus interested in inferring the 
matrix /?* given (y^\ X^), for 1 < j < r. Here inference means both recovery of the support of /?*, as well 
as closeness in numerical values on the non-zero elements. 

We are interested in solving this problem in the high-dimensional setting, where the number of ob- 
servations rij is potentially substantially smaller than the number of features p. High-dimensional settings 
arise in applications where measurements are expensive, and hence a sufficient number may be unavailable. 
Consistent recovery of /?* is now not possible in general; however, as is now well-recognized, it is possible if 
each is sparse, and the design matrices satisfy certain properties. 

Multiple sparse linear regression comes up in applications ranging from graphical model selection [13] and 
kernel learning [1 to function estimation [12] and multi-task learning [6], etc. In several of these examples, the 
different /3 *(■?') vectors are related, in the sense that they share portions of their supports/features, and may 
even be close in values on those entries. As an example, consider the task of learning handwritten character 
"A" for different writers. Since all these handwritings read "A", they should share a lot of features, but of 
course there might be few non-shared features indicating each individual handwriting. A natural question in 
this setting is: can inferring the vectors jointly (often referred to as multi-task learning [2]) result in lower 
sample complexity than inferring each one individually? 

When the sharing of supports is partial, it turns out the answer depends on the method used. Some 
"group LASSO" methods like i\jt q regularization can actually result in lower or higher sample complexity, 
as compared to doing for example separate LASSO, depending on whether the level of sharing among tasks 
is high or low, respectively. The "dirty mode" approach [6] develops a method, based on splitting f3* into two 



matrices which are regularized differently, which shows gains in sample complexity for all levels of sharing. 
We review the related existing work in section [TTTj 

Our Contribution: We provide a novel forward-backward greedy algorithm, designed for when the 
target structure is a combination of a sparse and block-sparse matrix. We provide theoretical guarantee on 
the performance of the algorithm in terms of both estimation error and support recovery. Our analysis is 
more subtle than [7], since we would like to have local assumptions on each task as opposed to having 
global assumptions on the whole matrix /3*. Ours is the first attempt to extend greedy approaches, which 
are sometimes seen to be both statistically and computationally more efficient than convex programming, 
to high-dimensional problems where the best /only approach involves the use of more than one structural 
models (sparse and group-sparse in our case). 

1.1 Related Work 

There is now a huge literature on sparse recovery from linear measurements; we restrict ourselves here to the 
most directly related work on multiple sparse linear regression. 

Convex optimization approaches: A popular recent approach to leverage sharing in sparsity patterns 
has been via the use of t\ji q group norms as regularizers, with q > 1; examples include the £i/£oo norm 
[16l[T8l[9], and the £1/^2 norm [H[T0]. The sample complexity of these methods may be sensitive [9] to the 
level of sharing of supports, motivating the "dirty model" approach [6]; in that paper, the unknown matrix 
was split as the sum of two matrices, regularized to encourage group-sparsity in one and sparsity in the 
other. Conceptually, this is similar to our line of thinking; however, their approach was based on convex 
optimization. We show that our method empirically has lower sample complexity than [6] (although we do 
not have a theoretical characterization of the constant multiplicative factor that seems to be the difference). 

Greedy methods: Several algorithms attempt to find the support (and hence values) of sparse vectors 
by iteratively adding, and possibly dropping, elements from the support. The earliest examples were simple 
"forward" algorithms like Orthogonal Matching Pursuit (OMP) p~5j [19], etc.; these add elements to the 
support until the loss goes below a threshold. More recently, it has been shown [20j [7] that adding a 
backward step is more statistically efficient, requiring weaker conditions for support recovery. Another line 
of (forward) greedy algorithms works by looking at the gradient of the loss function, instead of the function 
itself; see e.g. |5]. A big difference between our work and these is that our forward-backward algorithm works 
with two different classes of objects simultaneously: singleton elements of the matrix of vectors that need to 
be recovered, and entire rows of this matrix. This adds a significant extra dimension in algorithm design, as 
we need a way to compare the gains provided by each class of object in a way that ensures convergence and 
correctness. 

2 Our Algorithm 

We now first briefly describe the algorithm in words, and then specify it precisely. A natural loss function 
for our multi-task problem is 




Let /3 = [/3^ . . . /3^] be the pxr matrix which has the j th target vector f3^ as its j th column. Our algorithm 
is based on iteratively building and modifying the estimated support of /3, by adding (in the forward step) 
and removing (in the backward step) two kinds of objects: singleton elements (i, j), and entire rows m. 
The basic idea is to include in forward steps singletons/rows that give big decreases in the loss, and to 
remove in the backward steps those whose removal results in only a small increase in the loss. However, the 
kinds of objects cannot be compared in an "apples to apples" way, which means that doing the forward and 
backward steps in a way that ensures convergence and correctness is not immediate; as we will see below 
there are some intricacies in how the addition and removal decisions are made. 

It is easiest to understand our algorithm in terms of "reward" for forward steps, and "cost" for backward 
steps. Each inclusion results in a decrease in the loss; the corresponding reward is an appropriate weighting of 
this decrease, with the weighting tilted to favor singleton elements over entire rows. Similarly, each removal 
results in an increase in the loss; the corresponding cost is the same weighting of this increase. Each iteration 
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consists of one forward step, and (potentially) several backward steps. We maintain two sets: Q s C [p] x [r] 
of singleton elements, and ^5 C [r] of row^J 

1. In the forward step, we find the new object whose inclusion would yield the highest reward. If this 
reward is large enough, the object is included in its corresponding matrix (i.e. S or B). We also record 
the value of the reward, and the type of object. If this reward is less than an absolute threshold, the 
algorithm terminates. 

2. In each backward step, we find the object with the lowest cost. If this cost is "low enough", we remove 
the object. Else we do not. We now explain what "low enough" means; this is crucial. Say the object 
with lowest cost is a singleton element, and there are currently k singleton elements in the matrix S. 
Then we remove this element if its cost is smaller, by a fixed fraction v < 1, than the reward obtained 
when the k th singleton element was added to S (note that, because each iteration has several backward 
steps, this addition could have happened many forward steps prior to the current iteration). Similarly, 
if the object was a row, its cost is compared to the corresponding row reward obtained. 

Convergence: It can be seen that in any given iteration, the loss can actually increase! This is because 
there can be multiple backward steps in the same iteration. To see that the algorithm converges, note that 
the cost of each backward step is at most the fraction v < 1 of the reward of the corresponding forward step: 
the one it was compared to when we made the decision to execute the backward step. Thus, this backward 
and forward step, as a pair, result in a decrease in the loss. Convergence follows from the fact that there is 
a one to one correspondence between each backward step and its corresponding forward step; there are no 
backward steps that are "un-accounted for" . 



3 Performance Guarantees 

Shared and non-shared features: Consider the true matrix f3* , and for a fixed value of integer d define the set 
of "shared" features/rows Q% := {i G [p] | \supp(/3*)\ > d} that have support more than d. In this paper, we 
overload notation so that Q£ refers to both the set of rows above (in which case Ql C \p\) and the set of all 
elements in these rows (in which case Ql C [p] x [r]); correct interpretation is always clear from context. We 
can also define support on the non-shared features ^* C [p] x [r] as follows ^* := {(i, j) G [p] x [r] | /3*j ^ 
and i £ Q^} and finally, we define s* = + \{i : G ^*}|. 

Recall that our method requires a number w G (1, r) as an input, and outputs two sets - a set Q s C [p] x [r] 
of singleton elements, and a set ^5 C [p] of rows - and an estimated matrix /3 which is supported on fl s U £V 
Our main analytical result, Theorem [l] below, is a deterministic quantification of conditions (on X,z,/3*) 
under which our algorithm with w G (d — 1, d) as input, yields sparsistency - i.e. recovery of the shared rows 
Qb = the support on the non-shared rows Q s = - and small error \\/3 — /3*\\f- We start with the 
assumptions and then state the theorem. Corollary [I] covers two popular scenarios with randomness: where 
the design matrices X are deterministic but the noise vectors z are Gaussian, and the case where both X 
and z are Gaussian. 

Restricted Eigenvalue Property (REP): Fix a j, and sparsity level sj. We say the matrix := 
X^X^ T satisfies REP(sj) with constants C m i n and p > 1, if for all j, and all Sj-sparse vectors S G W, we 
have that 

Cmin||*||2 < \\Q U) S\\2 < pC min \\S\\ 2 Vl^llo < Sj (2) 

In our results, we assume (by taking the maximum/minimum over all tasks) that C m i n and p are the same 
for all tasks. Note that the level of sparsity Sj will still be different for different j. 

Gradient of the loss function: If there is no noise, i.e. z^ = for all j, then /3* is the optimal point 
of the loss function and the loss function has zero gradient there, i.e. V£(/3*) = 0. However, for any j if 
z^ 7^ 0, the corresponding gradient := — X^ T (y^ — X^ (3*(^) will not be zero either. We define A 
to be an upper bound on the infinity norm of this gradient, i.e. A := maxj || ||oo- 

Minimum non-zero element: Elements of (3 with very small magnitude are hard to distinguish from 
0, so we need to specify a lower bound on the magnitude of elements in the support of f3 we want to recover. 
Towards this end, for a given d, suppose d is the magnitude of the d th largest entry (by magnitude) entry 

in row m of /3*. and ^ = min mG ^ f3^ d . Finally, let /3^ in = min |min (ij)G ^ s , min meQ * ^,d\- Note 

1 We abuse notation by using fl^ to also refer to all elements in these rows. The correct usage is always clear from context. 
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Algorithm 1 Greedy Dirty Model 



Input: Data {y^\X^\ . . . ,y( r \ X^}, Stopping Threshold e, Sharing threshold weight w G Back- 
ward Factor v G (0, 1) 

Output Variables: set of singleton elements Q s C [p] x [r], set of rows ^5 C [p] 



Initialize: Q s = 0, Q b = 0, f3 = 0, fc = 
while true do {Forward Step} 

Find the best new singleton element and its reward 



[Vs,(hj)] <~ max - minC0 +<yeiej) 



Find the best new row and its reward 



[/ife,m] «— — x max I C(f3) — mm £(/3 + e m a T ) \ 
n,0h { aeR- • J 



Choose and record the bigger weighted gain fi^ k > max(/i s , 

if < e then {Gain too small} 

break (algorithm stops) 
end if 

If l^b > Ms then add row ^5 <— ^5 U m, else add singleton £7 S «— £} s U (i, j) 
Re-estimate on the new support set 

f3 arg min_ _ 

(3:supp((3)ch s UQ b 

Increment k ^— /c + 1 

while true do {Several backward steps for each forward step} 
Find the worst singleton element and its cost 



(?J)] <- min ft j) e t ej) - C0) } 



Find the worst row and its cost 



[v b ,m] ^ - x min \c0 - e m /3 m ) - C0)\ 



if mm(is s ,isb) > vep^ -1 ^ then {Cost too large} 

break (backward steps end) 
end if 

If < v s then remove row ^5 <— ^5 — m, else remove singleton S7 S «— Q s — (i, j) 
Re-estimate on the new support set 

/3 arg min_ _ 

(3:supp((3)ch s UQ b 

Decrement k <— k — 1 
end while 
end while 
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that a lower bound on f3 m [ n implies that there at least d elements whose magnitude is above that bound in 
every row of and also that every element in Q* is above that bound. 

Theorem 1 (Sparsistency). Suppose the algorithm is run with e, w and v . Let d be such that d— 1 < w < d, 
and for this d let shared rows Q^, non- shared features Q^, sparsity levels {s*}, and minimum element P^in 
as above. Suppose also that /3j^ in > 4p-^/e/C m i n; and for each j we have that REP (v s *j) holds with constants 
Cmin and p, and that n > 2 + 4rp 4 (p 4 - p 2 + 2)/(wv). Then, if we run Algorithm^ with stopping threshold 

a 2 2 * a 2 ^ ^ ^ — 

e > p — ; output (3 with shared support and individual support Q s satisfies: 

Wl/ min 

(a) Error Bound: \\P - P*\\ F < ^£ (e£ + W*) ■ 
(7^ Support Recovery: ^5 = £1% and ft s = Q*. 



Remark 1. The noiseless case z = corresponds to A = 0, in which case the algorithm can be run with 
e = 0. As can be seen, this yields exact recovery, i.e. /3 = /3*. 

Remark 2. The smaller the value of the backward factor the faster the algorithm is likely to converge 
as there are likely to be fewer backward steps. However, smaller v results in larger values of n and /3 m i n that 
we need for success; thus an algorithm with smaller v is likely to work on a smaller range of problems: a 
trade-off between statistical and computational complexity. 

Remark 3. Note that all the rows in Q s has less than d elements. To see this, suppose in contrary that 
there exist a row m in Q s that has more than or equal to d non-zeros. Since in the algorithm these single 
elements should compete with ^ times the improvement of the row, and d — 1 < w < d, the row m will be 
chosen for ^5 before those d entries are chosen for Q s . Once the row m goes to since we optimize for each 
entry on the row separately, it is impossible that any other single element on that row goes to Vt s . Hence, 
rows of Vt s have less than d entries and can be distinguished from the rows of fi&. 

Remark 4. Some recent results [3 [14] study greedy algorithms in a general "atomic" framework. While 
our setting could be made to fall into this general framework, the resulting algorithm would be different, and 
the performance guarantees would be weaker. These results require REP(r]^2 s*) for "each" and "all" task, 
which is order-wise (by an order of r) worse than our assumption REP(r]s*) for task j. To get this result, 
we leverage the fact that our loss function is separable with respect to tasks and hence, we do the analysis 
on a per-task basis. 

Corollary 1. For sample complexity rij > c\ Sj log(rp), with probability at least 1 — C2exp(— c^n) for some 
constants c\ — C4, we have 



(CI) Under the assumptions of the Theorem^ if z is A/"(0,cr) ; then the result holds for A = c^y °s{rp) j QT 
some constant C4. 



(C2) If is A/"(0,S^')) and REP assumption in Theorem^ holds for = E^EW) t , then the result 
holds. 

Proof of Theorem^ Let Sj = \h ( s j) U0 6 UO* (j) Uft£| be the size of the support of the estimated j th task union 
with the support of the true j th task. Inspired by [7 , our proof is based upon the following two lemmas: 

Lemma 1. If REP(sj) holds, then 

(1) ||^ - || 2 < ct~ + 1p\fmt U n*s U) ) - (Q b U m j) )ie) . 



(ii) 



(Qi j) UQ 5 )-(Q* (j) UQ*) 



Lemma 2. If e is chosen properly (see appendix for the exact expression), then k never exceeds (rj — l)s*, 
and hence, s~j < k + s* < ns* . 

Part (i) and (ii) of Lemma [I] are consequences of the fact that when algorithm stops the forward step 
and previous backward step fail to go through, respectively. To ensure the assumption of Lemma [I] holds, 
we need the Lemma [2] that bounds Sj. The proof can be completed as below. 
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(a) The result follows directly from part (i) of Lemma [T] noting that s < r]s* by Lemma [5] 

(b) Considering Remark 3, we only need to show that (O^UOj (j) )-(0 6 UOi j) ) = (0 6 UOi j) )-(O^UOj (j) ) = 
0. For any r G M, we have 

(^in) 2 |(fi* u - (n b u = (/?; in ) 2 e (n; u o:«) - (o, u o«) : > /?; in }| 

^^(Vn: ')-^^) 111 - 11 ^-^ 111 



mm mm 



where the last inequality follows from part (a) and the inequality (a + b) 2 < 2a 2 + 2b 2 . Now, dividing both 
sides by /3£ 2 in /2 we get 

2\(Qt u q:^) - (Q b u < ^ g ? 2 + -^1— |(n b * u n*™) - (n b u n?>)| 

^minvPmin/ ^minvPmin/ 

< i + |(o:uo: (i) )-(o b uo^)|. 

The inequality follows from the assumption on e and /3 min implying U Qs^) — (fib U = 0. To 

show the converse, from part (ii) of Lemma [I] we have 

u fffl) - (nt u < r ^ c ™ in n^i) n^ rp 2 ^ ||^ ( i)_ r (i)|| 2 

lv y v y| - we M ^(fi b ufi< j) )-(n*ufi: (j) ) M ~ we II ll 2 

/ ^P 2 ^min 2r]rs*\ 2 
~ we " 1/2 

due to the setting of the stopping threshold e. This implies that |(f^ U f^) — U fis^)| = and 
concludes the proof of the theorem. □ 



4 Experimental Results 
4.1 Synthetic Data 

To have a common ground for comparison, we run the same experiment used for the comparison of LASSO, 
group LASSO and dirty model in [9j[6]. Consider the case where we have r = 2 tasks each with the support 
size of s = p/10 and suppose these two tasks share a k portion of their supports. The location of non- 
zero entries are chosen uniformly at random and values of /3* and are chosen to be standard Gaussian 
realizations. Each row of he matrices and is distributed as A/"(0, I) and each entry of the noise 
vectors w\ and W2 is a zero-mean Gaussian draw with variance 0.1. We run the experiment for problem sizes 
p G 128, 256, 512 and for support overlap levels k G 0.3, 2/3, 0.8. 

We use cross-validation to find the best values of regularizer coefficients. To do so, we choose e = c slog ^ ; 
where c G [10 -4 , 10], and w G [1,2]. Notice that this search region is motivated by the requirements of our 
theorem and can be substantially smaller than the region needs to be searched for e and w if they are 
independent. Interestingly, for small number of samples n, the ratio w tends to be close to 1, where for 
large number of samples, the ratio tends to be close to 2. We suspect this phenomenon is due to the lack 
of curvature around the optimal point when we have few samples. The greedy algorithm is more stable if it 
picks a row as opposed to a single coordinate, even if the improvement of the entire row is comparable to the 
improvement of a single coordinate. 

To compare different methods under this regime, we define a rescaled version of sample size n, aka control 
parameter 9 = s \ og ( p ™(2-K)s) ' ^ or different values of we plot the probability of success, obtained by 
averaging over 100 problems, versus the control parameter O in Figj4] It can be seen that the greedy method 
outperforms, i.e., requires less number of samples, to recover the exact sign support of /?*. 

This result matches the known theoretical guarantees. It is well-known that LASSO has a sharp transition 
at 9 « 2 [17J3 group LASSO regularizer) has a sharp transition at © = 4 — 3k [9j and dirty model 

has a sharp transition at G = 2 — k [6]. Although we do not have a theoretical result, these experiments 
suggest the following conjecture: 

1 The exact expression is - i Q g( p ) — 2. Here, we ignore the term (2 — k)s comparing to p. 
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K=0.3ands=0.1*p k =2/3 and s=0.1*p 




Control Parameter Control Parameter © 



(a) Little overlap: k = 0.3 (b) Moderate overlap: k = 2/3 



k=0.8 and s=0.1*p 




0.5 0.6 1 1.2 1.5 1.6 2 2.5 

Control Parameter 



(c) High overlap: k — 0.8 

Figure 1: Probability of success in recovering the exact sign support using greedy algorithm, dirty model, Lasso and 
group LASSO (£i/£oo). For a 2-task problem, the probability of success for different values of feature- overlap fraction 
k is plotted. Here, we let s = p/10 and the values of the parameter and design matrices are i.i.d standard Gaussians 
and a = 0.1. Greedy method outperforms all methods in the sample complexity required for sign support recovery. 

Conjecture 1. For two-task problem with C m { n = p = 1 and Gaussian designs, the greedy algorithm has a 

sharp transition at Q = 1 — ^ . 

To investigate our conjecture, we plot the sharp transition thresholds for different methods versus different 

values of k G {0.05,0.3,2/3,0.8,0.95} for problem sizes p e {128,256,512}. Fig [2] shows that the sharp 

transition threshold for greedy algorithm follows our conjecture with a good precision. Although, theoretical 

guarantee for such a tight threshold remains open. 

4.2 Handwritten Digits Dataset 

We use the handwritten digit dataset [3] that is used by a number of papers [TTJ |4j |6] as a reliable dataset 
for optical handwritten digit recognition algorithms. The dataset contains p — 649 features of handwritten 
numerals 0-9 (r = 10 tasks) extracted from a collection of Dutch utility maps. The dataset provides 200 
samples of each digit written by different people. We take n/10 samples from each digit and combine them 
to a big matrix X G R nX:p , i.e., we set = X for all i E {1, . . . , 10}. We construct the response vectors yi 
to be 1 if the corresponding row in X is an instance of i th digit and zero otherwise. Clearly, y^s will have a 
disjoint support sets. We run all four algorithms on this data and report the results. 

Table [I] shows the results of our analysis for different sizes of the training set n. We measure the 
classification error for each digit to get the 10-vector of errors. Then, we find the average error and the 
variance of the error vector to show how the error is distributed over all tasks. Again, in all methods, 
parameters are chosen via cross-validation. It can be seen that the greedy method provides a more consistent 
model selection as the model complexity does not change too much as the number of samples increases while 
the classification error decreases substantially. In all cases, we get %25 — %30 improvement in classification 
error. 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Shared Support Parameter k 

Figure 2: Phase transition threshold versus the parameter k in a 2-task problem for greedy algorithm, dirty model, 
LASSO and group LASSO (£i/£oo regularizer) . The y-axis is O = - i og ( p ™(2-K)s) • Here, we ^ s = p/10 and the 
values of the parameter and design matrices are i.i.d standard Gaussians and a = 0.1. The greedy algorithm shows 
substantial improvement in sample complexity over the other methods. 



n 




Greedy 


Dirty Model 


Group LASSO 


LASSO 


10 


Average Classification Error 


6.5% 


8.6% 


9.9% 


10.8% 




Variance of Error 


0.4% 


0.53% 


0.64% 


0.51% 




Average Row Support Size 


180 


171 


170 


123 




Average Support Size 


1072 


1651 


1700 


539 


20 


Average Classification Error 


2.1% 


3.0% 


3.5% 


4.1% 




Variance of Error 


0.44% 


0.56% 


0.62% 


0.68% 




Average Row Support Size 


185 


226 


217 


173 




Average Support Size 


1120 


2118 


2165 


821 


40 


Average Classification Error 


1.4% 


2.2% 


3.2% 


2.8% 




Variance of Error 


0.48% 


0.57% 


0.68% 


0.85% 




Average Row Support Size 


194 


299 


368 


354 




Average Support Size 


1432 


2761 


3669 


2053 



Table 1: Handwriting Classification Results for greedy algorithm, dirty model, group LASSO and LASSO. The 
greedy method provides much better classification errors with simpler models. The greedy model selection is more 
consistent as the number of samples increases. 

References 

[1] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning 
Research, 9:1179-1225, 2008. 

[2] R. Caruana. Multitask learning. Machine Learning, 28:41-75, 1997. 

[3] R. P.W. Duin. Department of Applied Physics, Delft University of Technology, Delft, The Netherlands, 
2002. 

[4] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003. 

[5] P. Jain, A. Tewari, and I.S. Dhillon. Orthogonal matching pursuit with replacement. NIPS, 2011. 

[6] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, 
2010. 

[7] A. Jalali, C. Johnson, and P. Ravikumar. On learning discrete graphical models using greedy methods. 
In NIPS, 2011. 

[8] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task 
learning. In 22nd Conference On Learning Theory (COLT), 2009. 

[9] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits 
and perils of ^i )OC -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008. 



8 



[10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multi- 
variate regression. Annals of Statistics, 39:1-17, 2011. 

[11] S. Perkins and J. Theiler. Online feature selection using grafting. In ICML, 2003. 

[12] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal 
Statistical Society, Series B, . 

[13] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using i\- 
regularized logistic regression. Annals of Statistics, 38(3):1287-1319, . 

[14] A. Tewari, P. Ravikumar, and I.S. Dhillon. Greedy algorithms for structurally constrained high dimen- 
sional problems. NIPS, 2011. 

[15] J. A. Tropp and A.C. Gilbert. Signal recovery from random measurements via orthogonal matching 
pursuit. IEEE Transaction on Information Theory, 53:4655-4666, 2007. 

[16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27: 
349-363, 2005. 

[17] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using i\- 
constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55:2183-2202, 
2009. 

[18] C. Zhang and J. Huang. Model selection consistency of the lasso selection in high-dimensional linear 
regression. Annals of Statistics, 36:1567-1594, 2008. 

[19] T. Zhang. Sparse recovery with orthogonal matching pursuit under rip. IEEE Transaction on Informa- 
tion Theory, 57. 

[20] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In 
Neural Information Processing Systems (NIPS) 21, 2008. 



A Auxiliary Lemmas for Theorem 1 

Note that when the algorithm terminates, the forward step fails to go through. This entails that 

C0) - mf C0 + e m a T ) < we 

C0)- inf £08 + 7e<eJ)<e. 

(i,j)0 s ,<yeR 

Since our loss function is separable with respect to tasks, i.e., £(/?) = Ylj £0^ e J)> f° r a nx ^d task j, we 
can rewrite the second inequality as 

C0^ej) - inf C(f}Wef + 7 e<eJ) < e. 

The next lemma shows that this has the consequence of upper bounding the deviation in loss between 
the estimated parameters j3 and the true parameters f3*. 

Lemma 3 (Stopping Forward Step). When the algorithm stops with parameter j3, we have 

C (>')ej ) - L (> (i) ej) | < 2pC mi ^m: {j) uni) - 0» Uh b )\e \\^ - || 2 . (3) 
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Proof. Let A = /?* — j3. For any 77 G M, we have 

-|(n:«)ufij)-(fiW)ufi 6 )|c 



(4) 



< - £ (j§W) e j) ) + ,yci in ||A«||2. 

Here, we use the fact that VC0^eJ) is zero on the support of f3^\ Optimizing the RHS over 77, we obtain 

f£(/3*W)e7)-£f^)e7)) 2 

-l(fijW) UfiJ) - Uft h )|e < - ^ J iisYm ' 

4p 2 C^ n ||A0)||| 

whence the lemma follows. 



Lemma 4 (Stopping Error Bound). When the algorithm stops with parameter (3, we have 

0i) || 2 < J_ f -A. u fi 6 *) u u + 2 P y/\(sf a (i) u ft 6 *) - u n 6 )|« 



□ 



G m i n \ G m in 

Proo/. For AgF, let 

G(A) = £ (/?* (i) eJ + Aej) - £ (/?*«ej) - 2pC minV /|(^ (i) U Q* b ) - U fi 6 )|e ||A|| 2 . 

It can be seen that G(0) = 0, and from the previous lemma, G(A^) < 0. Further, G(A) is sub-homogeneous 
(over a limited range): G(tA) < tG(A) for t G [0,1] by basic properties of the convex function. Thus, 
for a carefully chosen r > 0, if we show that G(A) > for all A G {A : ||A||2 = r, ||A||o < s}}, where, 
Sj = \(fl*s ij) U ftj) U U fife) | is as defined in the proof of the theorem, then, it follows that || A^ || 2 < r. If 
not, then there would exist some t G [0, 1) such that ||tA^ || = r, whence we would arrive at the contradiction 

0<G(tA ( ^) <£G(A ( ^) <0. 

Thus, it remains to show that G(A) > for all A G {A : ||A||2 = r, ||A||o < s'j}. By restricted strong 
convexity property of £(•), we have 

£(/3* (i) eJ + AeJ) - £(/3*«eJ) = £ (£(/?* + Aej) - £(/?*)) 



>(V£(/r),A) + C in ||A||2. 



We can establish 



(V£(/3*),A)>-|(V£(r),A}| 

^-IIVADIIoollAII^-AIIAH,, 

and hence, 

G(A) > -A|| A|| x + C&JIAII! - 2pC min ^mt U) un* b ) - s j) U n 6 )|e|| A| 



> 



( - a^k^u^u^u^)! + C 2 min \\ A\\ 2 

- 2 P c min \Jm*s U) un* b ) - un 6 )|e) ||A|| 2 > o, 
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if ||A|| 2 = r for 



1 



A 



Cmin \ Cmin 

This concludes the proof of the lemma. 



J\(si* a U) u n*) u u h b )\ + 2 P ^/\(n* s U) u «*) - u n 6 )|, 



□ 



Next, we note that when the algorithm terminates, the backward step with the current parameters has 
failed to go through. This entails that 

inf C{P - e m p m ) - C0) > v we 
inf C0-p\ j) e^)-C0) > ve. 
The next lemma shows the consequence of this bound. 

Lemma 5 (Stopping Backward Step). When the algorithm stops with parameter /3, we have 



(5) 



8 ij) 

^(ni 3) ufii,)-(fi; (3) unj) 



2 r P 2 Cmin 



i(o«uf),)-(fi:«u^)i. 



Proof. We have 
|(0«U0 6 )-(0:WuQ£)|- f 



< 



1^0) _ u ^* u n b )\-ve + -\ft b - u nr)|i/u>e 

r r 

$3 (£(£-^Vj) -£(£)) + J £ 

=.6.( J ')-r.o!( J ')| i.O?i 1.6.^ raef2 b -(ft* (j) Uft*) 



< (v/;(^),^)_ (n; (i) un j uab) ) +p 2 c2 dn ||^ ) _ m#u . ) 



(6) 



+ ( V ^A-(n:«)unj,)VC^||^_ (n:0)un:) || 



P U min 



/3 



(n^ufi b )-(n: (i) uQ*) 



where, the second inequality uses the fact that [V£(/3)b 



0. 



□ 



B Lemmas on the Stopping Size 

Lemma 6. If e > c2 A /or some 77 > 2 + 4rp and REP (r]s*) holds, then the algorithm stops 

with (column) support size Sj < (n — for all j G {1, 2, . . . , r}. 

Proof Consider the first time the algorithm reaches k = (77 — + 1. By Lemmas [8] and |9j we have 



fwv Sj-l-S* 



Sj-1 



< 



\(0\k - 1) U Q b (k - 1)) - U Sl*)\ 

r \\ \(hi j \k - 1) u n b (k - 1)) u (nt U) u 



*(i) 



< 



Cmin \A 



Kfi^fc - 1) u fi 6 (fc - 1)) u u n* b )\ 



+ 2p 2 



m* s (: » u np - (nj^jk - 1) u Q b (k - 1))| 
\ m: U) u n;) u 0J\k - 1) u fi 6 (fc - i))| 



< 



\p , /0 V2(P 2 " 1) 



Cmin\/^ 



2p 2 



Sj +s*-l 
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Hence, we get 



2p* < \p 



Cmin \[t 



For 77 > 2 



4rp 4 (p 4 -p 2 +2) 



the LHS is positive and we arrive to a contradiction with the assumption on e. 



□ 



Lemma 7 (General Forward Step). For any j G {1,2,..., r} ; £fte /?rs£ time the algorithm reaches a (column) 
support size of Sj at the beginning of the forward step, we have 



(j}*Wef)-C(pV\k-l)ej) 



<2pC n 



^J\(n; {j) U fi*) - (W\k - 1) U n b (k - l))\fii k) e - P\k - 1) 



Proof. According to the forward step, we have 

C (p(k - 1)) - ^ inf C (j3(k - 1) + jaej) = 4 „ 

Since the loss function is separable with respect to the columns of /?, for any fixed j G {1, . . . , r} we have 



C (p\k - l)ej) - inf C (p\k - l)ej + je^ej) < $\ 

Similar to Q, for any n G R, we have 

- u n* b ) - (n«(fc - 1) u fi b (fc - 1))| M «e 

< q (C (/3* (i >ej) - C (p\k - l)ej)) + 
Optimizing the RHS over 77, we obtain 

(C (f3*U) e f)-c(p^(k-l)ej)y 



fi*ti) _ pti)(k — 1) 



u ni) - (n{ j) (k - 1) u - 1)) 

This concludes the proof of the lemma. 



> 



□ 



Lemma 8 (General Error Bound). For any j G {1, 2, . . . , r}, the first time the algorithm reaches a (column) 
support size of sj at the beginning of the forward step, we have 



P*ti) _ p(i)(k — 1) 



A 

2 cii,-, 



u nj) u - 1) u n b (k - 1)) 

2p 



Cry 



(nt U) u fi*) - (n^ ; (jfe - 1) u n b (k - 1)) 



(fc) 
/is e. 



Proof The proof is identical to the proof of lemma [4] and is omitted. 



□ 
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Lemma 9 (General Backward Step). For any j G {1, 2, . . . , r}, the first time the algorithm reaches a (column) 
support size of Sj at the beginning of the forward step, if Sj > s* + 2rp ^ ~ x \ then 



(Q { s 3) (k - 1) u n b (k - 1)) - u n*) 

pC m in \fr 



fP-m - .(« (*-!) 

^(ni 3, (*-i)un 6 (fc-i))-(n; l:,, unf) v 



PV2(P 2 -1) 



A 



(fc) 



Proof Under the assumption of the lemma, the immediate previous backward step has not gone through 
and hence, 

inf C (f}(k) - $ j \k)eiej) - C (f}(k)) > vp {k) e 
(i,j)en s (k-i) v J J \ J 



mi 

m£Q b (k — l) 



C (j3(k) - e m m (fc)) - C (f)(kj) > vp {k) we. 

Since the loss function is separable with respect to the columns of /?, for a fixed j G {1, 2, . . . , r}, we have 

inf c(P\k)eJ-^\k)eiej)-C(P\k)ej) >^e. 
i-.(i,j)en s (k-i) v J J \ J J 

Consequently, similar to (J6|, we can show that 



- 1) u h b (k - 1)) - U Q* b ) 



(k) 



we 



< 



hi j \k - 1) - (si*w un* b u n b (k - 1)) 

(i) 



vp^e 



^-l)-(^U^) 



,2^2 
min 



^i J ' ) (/c-i)-(^: (j) uQ*ur2 b (/c-i))^^ 



^(Qi i) (/e-l)UQ 6 (/e-l))-(0* (j) UQ*)^ ^ 



- p 2 C^ in 



@h b (k-i 



b (k-i)-(n* s U) un* b ) 

2 



wvp^e 



2,oo 



where, A( fc ) = ^ ^ , m (fc) - - 1). This entails that 



(fti j) (/c - 1) u ft 6 (jfe - 1)) - (n: uj u ft*) 



0') i 



(fc) 



V 



pCminVr 





< 







(Qi j) (fc-l)Uf7 b (fc-l))-(f7* (j) UQ*) 



Thus, it suffices to show that ||A^^|| 2 < -^—y2(p 2 — l)p s e since /4 < Notice that by our assump- 
tion on the size of the support, the first term is always larger than the second provided we can show this in- 
equality. There are two cases: (a) if we added a single element in the previous step for which we show the above 



inequality, and (b) if we added a row in the previous step for which we show || || 2 < \j2(p 2 — 

Since ^ < e and p^ < p^ h \ the result follows. We prove (a) and omit the proof of (b) since it is identical. 



We drop the super- and sub-script j for the ease of the notation in the rest of the proof. From the forward 
step, we have 



C 



(j8(fc - 1)) 



inf 



(j3(k - 1) 



+ / ye i e j 



(i,j)gn a (fc-i),7ei 

Let (z*, j*,7* 7^ 0) be the optimizer of the equation above. Now, we have 



2 

min 



A (fc) 



< 



^(^)fi.( fc -i)uS 6(fc -i))-^(^-l)) 

£ (^(*)n.(*-i)un 6(fc -i)) - ^ (m) + £ (m) - £ - !)) 



< n 2 r 2 



r 2 

w min 



A (fe) 



r 2 

w min 
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Hence, \\A( k % < ^ $f* } (fc) 



and we only need to show that 



< 



2p- 



Since 



< 



/3^*\k) — 7* + 1 7* |, we can equivalently control the latter two terms. First, by forward step construction, 

Cmin l7*| 2 < £ (fi( k ~ !)) -£ (fi( k ~ !) + 7* e ** e J*) = Vs k) e and hence |7*| < Second, we claim that 

Pi{*\ k ) — 7* < (2p 2 — 1) |7*| and we are done. 



In contrary, suppose 



W;\ k ) - 7* > (V - 1) ' l7*| 2 > P 2 17* I"- We have 



C 2 



31f* } (fc)-7, 



>P 2 <^min 17*1 



> 



-7. ei .e£) -C(0(kj) 



> C 



> r* 2 

— min 



A (fe) 



+ c 2 



4- r 2 

^ IM i T I 



A (fe) 



+ c 2 



- C (fi(k - 1)) + C ((3{k - 1)) - C (p(k)) 
M!*\k) " 7*f + V^,,,)^ (d{k - 1)) - 7 *) 

W'\k) 



This is a contradiction provided C^ in /3-f* ) (fc) ^ + V^j^C (fi{k - 1)) ^•f* ) (A;) - 7*) > 0. Later in the 
proof, we will show that Sign \^{i,,j,)£> (ft{k — = —Sign (7*) and that 2C^ lin |7*| < W^^j^C \J3{k — 1)J 



2p 2 ^minl7*|. With these, if 

lows. Otherwise, we have 
hence, 



< 



< 1, we have V 



Mi'\k) > pg-\k) -|t*i= 31f* ) (fc)- 7at 



SO'.), 



> and the claim fol- 



so that 



3£°(A0 > 2p 2 | 7 *| and 



r 2 



! + V (i , (£(fc-l)) ($f->(fc)- 7 *) 
> 2p 2 C^ in | 7 *| |flj*>(*) - 7*| - 2p 2 C^ in | 7 ,| |%'->(*) - 7* 
= 0. 

To get the claimed properties of V^^j^C \J3(k — 1)J, note that 

Cl in |7*| 2 < ^ -!))-£ - 1) + 7-ei.eJ.) 

<-Cmin |7*| 2 -V (w ,)£(£(fc-l))7*, 

and hence Sign (v (i , )i , ) £ - 1))) = -Sign (7,) and 2C 2 in | 7 *| < |v (i . ,,-.)£ (j3(k - 1) 

P 2 Cl in |7*| 2 > C (p{k - 1)) - £ - 1) + 7 , ei .e£ 

>-f?cL> \7*\ 2 -Va,j,)£(Mk-i) 



establish 



Also, we can 



Since — V 



(^(fc-l)) 7, 
eludes the proof of the lemma. 



> 0, we can conclude that 



7 (u Jm )C(p(k-l)) 



< 2p 2 ^minl7*|. Thiscon- 

□ 



C Proof of Corollary [T] 

The result follows from the following two lemmas. 



14 



Lemma 10. Given the sample complexity n 3 - > c$ log(rp) for some constant c$ and all j G {1,2,..., r}, we 
have 

j V n 

with probability at least 1 — cq exp(— cjn) for some positive constants C5, cq and cj. 

The proof follows from Lemma 5 in [17]. We state our theoretical result in terms of A for the sake of 
generality. This parameter can be replaced with any upper-bound on and our guarantee still holds. 

Lemma 11. // each row of the design matrix G W ixp is distributed as A/"(0, E^) and E^ satisfies 
REP(sj), then for any small > 0, the matrix = X^X^ T satisfies 

(1 - 0)C mia \\8\\ 2 < \\Q^5\\ 2 < (1 + e) P C mia \\8\\ 2 , (7) 

for all \\S\\o < Sj, with probability 1 — C8exp(— c$n) provided that nj > cio(0) Sj\og(p), where c$ — C10 are 
constants independent of (nj , Sj , p) . 

The proof follows from Lemma 9 (Appendix K) in [17 . This lemma shows that for Gaussian design 
matrices, REP(sj) is satisfied with high probability for 0(sj \og(p)) samples. 
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